Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms*
نویسندگان
چکیده
Gradient based policy optimization algorithms suffer from high gradient variance, this is usually the result of using Monte Carlo estimates of the Qvalue function in the gradient calculation. By replacing this estimate with a function approximator on state-action space, the gradient variance can be reduced significantly. In this paper we present a method for the training of a Gaussian Process to approximate the action-value function which can be used to replace the Monte Carlo estimation in the policy gradient evaluation. An iterative formulation of the algorithm will be given for better suitability with online learning.
منابع مشابه
Signal-to-Noise Ratio Analysis of Policy Gradient Algorithms
Policy gradient (PG) reinforcement learning algorithms have strong (local) convergence guarantees, but their learning performance is typically limited by a large variance in the estimate of the gradient. In this paper, we formulate the variance reduction problem by describing a signal-to-noise ratio (SNR) for policy gradient algorithms, and evaluate this SNR carefully for the popular Weight Per...
متن کاملBayesian Policy Gradient Algorithms
Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. Since Monte Carlo methods tend to have high variance, a large number of samples is required, resulting in slow convergence. In this paper, we propose a Bayesian fra...
متن کاملMean Actor Critic
We propose a new algorithm, Mean Actor-Critic (MAC), for discrete-action continuous-state reinforcement learning. MAC is a policy gradient algorithm that uses the agent’s explicit representation of all action values to estimate the gradient of the policy, rather than using only the actions that were actually executed. This significantly reduces variance in the gradient updates and removes the n...
متن کاملRegularized Policy Gradients: Direct Variance Reduction in Policy Gradient Estimation
Policy gradient algorithms are widely used in reinforcement learning problems with continuous action spaces, which update the policy parameters along the steepest direction of the expected return. However, large variance of policy gradient estimation often causes instability of policy update. In this paper, we propose to suppress the variance of gradient estimation by directly employing the var...
متن کاملGeometric Variance Reduction in Markov Chains. Application to Value Function and Gradient Estimation
We study a variance reduction technique for Monte Carlo estimation of functionals in Markov chains. The method is based on designing sequential control variates using successive approximations of the function of interest V . Regular Monte Carlo estimates have a variance of O(1/N), where N is the number of sample trajectories of the Markov chain. Here, we obtain a geometric variance reduction O(...
متن کامل